Classifying Digital Resources in a Practical and Coherent Way with Easy-to-Get Features

نویسندگان

  • Chong Chen
  • Hongfei Yan
  • Xiaoming Li
چکیده

With a rich variety of forms and types, digital resources are complex data objects. They grows fast in volume on the Web, but hard to be classified efficiently. The paper presents a practical classification solution using features from file names and extensions of digital resources. The features are easy to get and common to all resource. But they are generally low frequency and sparse, which implies that statistical approach may not work well. Our solution combines Naive Bayes (NB) classifier with Simple Good-Turing (SGT) probability estimation, which shows great promise for this condition with a total accuracy of 80%. In our opinion, the results are due to 1) the features fit the NB’s conditional independence hypothesis well; 2) the abound one-timeoccurrence features lead to reasonable probability estimation on unobserved features, which also means general feature selection strategy is not needed in this case. A 7.4TB digital resource collection, CDAL, is used to train and

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison of Artificial Neural Network Training Algorithms for Predicting the Weight of Kurdi Sheep using Image Processing

Extended Abstract Introduction and Objective: Due to weakness, the occurrence of unwanted errors, the impact of the environment and exposure to natural events, human always make mistakes in their diagnoses of the environment or different topics, so that different people 's perception of a single and unique event may be very different and be diverse. Nowadays, with the development of image proc...

متن کامل

High performance of the support vector machine in classifying hyperspectral data using a limited dataset

To prospect mineral deposits at regional scale, recognition and classification of hydrothermal alteration zones using remote sensing data is a popular strategy. Due to the large number of spectral bands, classification of the hyperspectral data may be negatively affected by the Hughes phenomenon. A practical way to handle the Hughes problem is preparing a lot of training samples until the size ...

متن کامل

Curl Size and Pelt Color Determination of Zandi Lambs Using Image Processing and Artificial Neural Network

In this study, a method based on using image processing and artificial neural network is introduced to determine pelt color and curl size of newborn lambs in Zandi sheep. The data was collected from 300 newborn lambs reared in the Zandi sheep breeding centre of Khojir, Tehran. Primarily, curl size and pelt color of new born lambs was recorded by experienced appraisers, and at the same time, sev...

متن کامل

Children's Interactive Book in Iran A Review On Existing Situation And Production Challenges

Background and objective: Audience communication with digital media is bilateral and interactive. Book publishing in non-printed and digital formats has also created the feature of interactivity in books. This paper presents an attempt to identify the challenges of creating interactive childrenchr('39')s books in Iran by evaluating childrenchr('39')s interactive books published in Iran and the ...

متن کامل

An Investigation into Digital Library Users' Collaborative Information Seeking (CIS) of Graduate Students of Kharazmi University with an emphasis on two easy and difficult scenarios

Background and Aim: Understanding collaborative information seeking behaviour requires knowing about personal characteristics, differences between users, and the type of interactions occur during a collaborative behaviour. The aim of this study is to investigate dimensions of collaborative information seeking behaviour of graduate students of Kharazmi University when using a digital library bas...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008